PDF (Portable Document Format) is one of the most pervasive formats for distributing text — research articles, government reports, historical documents, scanned books, and court judgements are routinely archived as PDFs. For linguistic and computational research, the ability to extract clean, machine-readable text from PDFs is therefore a fundamental data-preparation skill. This tutorial shows how to do that efficiently and reliably in R.
Two complementary packages are covered:
pdftools — fast, dependency-light extraction for digitally generated PDFs (PDFs rendered from Word, LaTeX, InDesign, or a web browser, where the text layer is embedded in the file)
tesseract — slower but more robust OCR (Optical Character Recognition) for image-based PDFs, scanned documents, faxes, and any PDF where the text is stored as a raster image rather than as selectable characters
The tutorial also covers how to extract document metadata and page-level information with pdftools, how to configure the tesseract engine for different languages, how to handle multi-page scanned PDFs, and how to combine OCR output with automated spell-checking and suggested-correction workflows using hunspell.
Prerequisite Tutorials
Before working through this how-to, familiarity with the following is recommended:
Schweinberger, Martin. 2026. Converting PDFs to Text with R. Brisbane: The Language Technology and Data Analysis Laboratory (LADAL). url: https://ladal.edu.au/tutorials/pdf2txt/pdf2txt.html (Version 2026.02.24).
pdftools vs tesseract: Choosing the Right Tool
Before writing a single line of code, the most important decision is which extraction tool to use. The answer depends entirely on how the PDF was created.
Choosing between pdftools and tesseract
Situation
Recommended tool
PDF rendered from Word, LaTeX, or InDesign
pdftools
PDF saved from a web browser or exported from software
pdftools
Scanned physical document (book, report, fax)
tesseract
PDF of a photograph or image
tesseract
PDF with embedded fonts but garbled character encoding
tesseract
Mixed PDF (some pages have text layer, others are scanned)
tesseract for scanned pages; pdftools for text pages
How to tell which type you have: Open the PDF in a PDF viewer and try to select and copy a word. If you can select individual characters and the copied text is legible, the PDF has an embedded text layer and pdftools is the right choice. If selecting text is impossible, produces garbled output, or selects only whole blocks, the PDF is image-based and you need tesseract.
Quick Diagnostic in R
Code
# A quick way to check whether a PDF has a usable text layer:# if nchar() returns 0 or near-zero for all pages, use tesseract insteadlibrary(pdftools)test <- pdftools::pdf_text("your_file.pdf")nchar(test) # characters per page — 0 means no text layer
Setup
Installing Packages
Code
# Run once to install — comment out after installationinstall.packages("pdftools")install.packages("tesseract")install.packages("tidyverse")install.packages("here")install.packages("hunspell")install.packages("flextable")
System Dependencies for tesseract
The tesseract R package is a wrapper around the Tesseract OCR engine, which must be installed separately as a system library before the R package will work.
Windows: Download and run the installer from github.com/UB-Mannheim/tesseract/wiki. After installation, make sure the Tesseract binary folder is on your system PATH.
macOS: Run brew install tesseract in a Terminal (requires Homebrew).
Linux (Debian/Ubuntu): Run sudo apt-get install tesseract-ocr in a Terminal.
Additional language packs for non-English OCR can be installed separately — see the Language Support section below.
Loading Packages
Code
library(pdftools) # text-layer PDF extraction and metadatalibrary(tesseract) # OCR for image-based PDFslibrary(tidyverse) # dplyr, stringr, purrrlibrary(here) # portable file pathslibrary(hunspell) # spell checking and correctionlibrary(flextable) # formatted display tables# Initialise the English Tesseract engine once and reuse iteng <- tesseract::tesseract("eng")
Data and Folder Setup
The code in this tutorial assumes the following folder structure within your R project:
Download the four sample PDF files from the links below and save them in data/PDFs/: pdf0 · pdf1 · pdf2 · pdf3
Text Extraction with pdftools
Section Overview
What you’ll learn: How to extract text from a single PDF and from a folder of PDFs using pdftools, how to retrieve document metadata and page-level information, how to work with page numbers, and how to save extracted text to disk
The pdftools package (ooms2022pdftools?) provides fast, dependency-light text extraction for PDFs that have an embedded text layer. It wraps the Poppler PDF rendering library and works without any external system dependencies beyond Poppler itself (which is bundled with the package on Windows and macOS).
Extracting Text from a Single PDF
The workhorse function is pdftools::pdf_text(). It returns a character vector with one element per page; we paste the pages together and collapse any internal whitespace with str_squish().
Code
# Path to the PDF (Wikipedia article on corpus linguistics)pdf_path <-"tutorials/pdf2txt/data/PDFs/pdf0.pdf"# Extract text: one element per pagepages <- pdftools::pdf_text(pdf_path)cat("Pages extracted:", length(pages), "\n")
Pages extracted: 2
Code
# Collapse all pages into a single string and clean whitespacetxt_output <- pages |>paste0(collapse =" ") |> stringr::str_squish()
substr(txt_output, 1, 1000)
Corpus linguistics - Wikipedia https://en.wikipedia.org/wiki/Corpus_linguistics Corpus linguistics Corpus linguistics is the study of language as expressed in corpora (samples) of "real world" text. Corpus linguistics proposes that reliable language analysis is more feasible with corpora collected in the field in its natural context ("realia"), and with minimal experimental-interference. The field of corpus linguistics features divergent views about the value of corpus annotation. These views range from John McHardy Sinclair, who advocates minimal annotation so texts speak for themselves,[1] to the Survey of English Usage team (University College, London), who advocate annotation as allowing greater linguistic understanding through rigorous recording.[2] The text-corpus method is a digestive approach that derives a set of abstract rules that govern a natural language from texts in that language, and explores how that language relates to other languages. Originally derived manually, cor
Working Page by Page
Sometimes it is more useful to keep the page structure rather than collapsing everything into one string — for instance, when you need to track which page a quote came from, or when processing very large PDFs that would be unwieldy as a single string. In that case, work with the pages vector directly:
Code
# Process pages individually: clean each page separatelypages_clean <- pages |> purrr::map_chr(stringr::str_squish)# Inspect the second pagecat(pages_clean[2])# Create a data frame with one row per pagepage_df <-data.frame(page =seq_along(pages_clean),text = pages_clean)
Extracting Document Metadata
pdftools::pdf_info() returns a rich list of document metadata: title, author, creation date, modification date, PDF version, page dimensions, and more. This information is useful for provenance tracking and for verifying that you have the right document.
pdftools::pdf_pagesize() returns the width and height of each page (in points, where 1 point = 1/72 inch). This is useful for detecting mixed-orientation documents (portrait and landscape pages) or for understanding the physical layout of tables and figures.
pdftools::pdf_fonts() lists the fonts embedded in the document — helpful for diagnosing encoding problems or unusual character sets.
Code
# Page dimensions (in points)page_sizes <- pdftools::pdf_pagesize(pdf_path)head(page_sizes, 3)
top right bottom left width height
1 0 612 792 0 612 792
2 0 612 792 0 612 792
For batch processing, we write a reusable function that takes a folder path, finds all PDF files, extracts and cleans their text, and returns a named character vector.
Code
convertpdf2txt <-function(dirpath, pattern ="\\.pdf$") { files <-list.files(dirpath, pattern = pattern,full.names =TRUE, ignore.case =TRUE)if (length(files) ==0) stop("No PDF files found in: ", dirpath) texts <-sapply(files, function(f) { pdftools::pdf_text(f) |>paste0(collapse =" ") |> stringr::str_squish() }, USE.NAMES =TRUE)# Use clean base names (without path and extension) as namesnames(texts) <- tools::file_path_sans_ext(basename(files))return(texts)}
Corpus linguistics - Wikipedia https://en.wikipedia.org/wiki/Corpus_linguistics Corpus linguistics Corpus linguistics is the study of language as expressed in corpora (samples) of "real world" text. Corpus linguistics proposes that reliable language analysis is more feasible with corpora collected in the field in its natural context ("realia"), and with minimal experimental-interference. The field of corpus linguistics features divergent views about the value of corpus annotation. These views range from John McHardy Sinclair, who advocates minimal annotation so texts speak for themselves,[1] to the Survey of English Usage team (University College, London), who advocate annotation as allowing greater linguistic understanding through rigorous recording.[2] The text-corpus method is a digesti
Language - Wikipedia https://en.wikipedia.org/wiki/Language Language A language is a structured system of communication. Language, in a broader sense, is the method of communication that involves the use of – particularly human – languages.[1][2][3] The scientific study of language is called linguistics. Questions concerning the philosophy of language, such as whether words can represent experience, have been debated at least since Gorgias and Plato in ancient Greece. Thinkers such as Rousseau have argued that language originated from emotions while others like Kant have held that it originated from rational and logical thought. 20th-century philosophers such as Wittgenstein argued that philosophy is really the study of language. Major figures in linguistics include Ferdinand de Saussure a
Natural language processing - Wikipedia https://en.wikipedia.org/wiki/Natural_language_processing Natural language processing Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data. Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation. Contents History Rule-based vs. statistical NLP Major evaluations and tasks Syntax Semantics An automated online assistant Discourse providing customer service on a Speech web page, an example of an Dialogue ap
Computational linguistics - Wikipedia https://en.wikipedia.org/wiki/Computational_linguistics Computational linguistics Computational linguistics is an interdisciplinary field concerned with the statistical or rule-based modeling of natural language from a computational perspective, as well as the study of appropriate computational approaches to linguistic questions. Traditionally, computational linguistics was performed by computer scientists who had specialized in the application of computers to the processing of a natural language. Today, computational linguists often work as members of interdisciplinary teams, which can include regular linguists, experts in the target language, and computer scientists. In general, computational linguistics draws upon the involvement of linguists, compu
Building a Page-Level Data Frame
For downstream corpus analysis it is often useful to have a tidy data frame with one row per page, carrying the document name and page number alongside the text. This structure integrates naturally with dplyr pipelines.
cat("Total pages across all documents:", nrow(page_corpus), "\n")
Total pages across all documents: 21
Saving Extracted Texts to Disk
Code
# Save each text as a .txt file in data/txts/output_dir <- here::here("tutorials/pdf2txt/data/txts")dir.create(output_dir, showWarnings =FALSE)lapply(seq_along(txts), function(i) { out_path <-file.path(output_dir, paste0(names(txts)[i], ".txt"))writeLines(text = txts[[i]], con = out_path)message("Saved: ", out_path)})
OCR with tesseract
Section Overview
What you’ll learn: How to perform OCR on image-based PDFs using tesseract, how to configure the OCR engine, how to use non-English language models, and how to handle multi-page scanned PDFs
The tesseract package (ooms2023tesseract?) provides R bindings for Google’s Tesseract OCR engine, an open-source OCR system that supports over 100 languages. Unlike pdftools, which reads an embedded text layer, tesseract analyses the image content of each page and attempts to identify characters from their visual appearance. This makes it the right tool for scanned documents, photographs of text, and any PDF where the text is stored as pixels rather than as Unicode characters.
Corpus linguistics - Wikipedia https://en.wikipedia.org/wiki/Corpus_linguistics WIKIPEDIA e e e Corpus linguistics Corpus linguistics is the study of language as expressed in corpora (samples) of "real world" text. Corpus linguistics proposes that reliable language analysis is more feasible with corpora collected in the field in its natural context ("realia"), and with minimal experimental-interference. The field of corpus linguistics features divergent views about the value of corpus annotation. These views range from John McHardy Sinclair, who advocates minimal annotation so texts speak for themselves, to the Survey of English Usage team (University College, London), who advocate annotation as allowing greater linguistic understanding through rigorous recording.|2! The text-corpus method
Corpus linguistics - Wikipedia https://en.wikipedia.org/wiki/Corpus_linguistics WIKIPEDIA e e e Corpus linguistics Corpus linguistics is the study of language as expressed in corpora (samples) of "real world" text. Corpus linguistics proposes that reliable language analysis is more feasible with corpora collected in the field in its natural context ("realia"), and with minimal experimental-interference. The field of corpus linguistics features divergent views about the value of corpus annotation. These views range from John McHardy Sinclair, who advocates minimal annotation so texts speak for themselves, to the Survey of English Usage team (University College, London), who advocate annotation as allowing greater linguistic understanding through rigorous recording.|2! The text-corpus method
Language - Wikipedia https://en.wikipedia.org/wiki/Language WIKIPEDIA Language A language is a structured system of communication. Language, in a broader sense, is the method of communication that involves the use of — particularly human — languages. J[2II3] The scientific study of language is called linguistics. Questions concerning the philosophy of language, such as whether words can represent experience, have been g debated at least since Gorgias and Plato in ancient Greece. Thinkers such as Rousseau have argued that language originated from emotions while others like Kant have ~ ‘ held that it originated from rational and logical thought. 20th-century philosophers such as Wittgenstein argued that philosophy is really the study of language. Major Pay yi figures in linguistics include F
Language - Wikipedia https://en.wikipedia.org/wiki/Language WIKIPEDIA Language A language is a structured system of communication. Language, in a broader sense, is the method of communication that involves the use of — particularly human — languages. J[2II3] The scientific study of language is called linguistics. Questions concerning the philosophy of language, such as whether words can represent experience, have been g debated at least since Gorgias and Plato in ancient Greece. Thinkers such as Rousseau have argued that language originated from emotions while others like Kant have ~ ‘ held that it originated from rational and logical thought. 20th-century philosophers such as Wittgenstein argued that philosophy is really the study of language. Major Pay yi figures in linguistics include F
Natural language processing - Wikipedia https://en.wikipedia.org/wiki/Natural_ language processing WIKIPEDIA e Natural language processing Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions — between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data. ce ee EI Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation. —— oo - & Contents nae History eicseuuc Rule-based vs. statistical NLP chm Hi I'm your automated online ‘ter coremer s Major evaluations and tasks svat How may nt ou? f= kd Syntax iereii
Natural language processing - Wikipedia https://en.wikipedia.org/wiki/Natural_ language processing WIKIPEDIA e Natural language processing Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions — between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data. ce ee EI Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation. —— oo - & Contents nae History eicseuuc Rule-based vs. statistical NLP chm Hi I'm your automated online ‘ter coremer s Major evaluations and tasks svat How may nt ou? f= kd Syntax iereii
Computational linguistics - Wikipedia https://en.wikipedia.org/wiki/Computational_ linguistics WIKIPEDIA e e e e Computational linguistics Computational linguistics is an interdisciplinary field concerned with the statistical or rule-based modeling of natural language from a computational perspective, as well as the study of appropriate computational approaches to linguistic questions. Traditionally, computational linguistics was performed by computer scientists who had specialized in the application of computers to the processing of a natural language. Today, computational linguists often work as members of interdisciplinary teams, which can include regular linguists, experts in the target language, and computer scientists. In general, computational linguistics draws upon the involvement
Computational linguistics - Wikipedia https://en.wikipedia.org/wiki/Computational_ linguistics WIKIPEDIA e e e e Computational linguistics Computational linguistics is an interdisciplinary field concerned with the statistical or rule-based modeling of natural language from a computational perspective, as well as the study of appropriate computational approaches to linguistic questions. Traditionally, computational linguistics was performed by computer scientists who had specialized in the application of computers to the processing of a natural language. Today, computational linguists often work as members of interdisciplinary teams, which can include regular linguists, experts in the target language, and computer scientists. In general, computational linguistics draws upon the involvement
Language Support
By default, tesseract() uses the English language model ("eng"). For documents in other languages, you must first install the relevant Tesseract language pack and then initialise an engine with that language code.
Code
# List all language packs already installed on your systemtesseract::tesseract_info()$available# Install additional language packs from within R# (downloads the trained model data from the tesseract-ocr GitHub repository)tesseract::tesseract_download("deu") # Germantesseract::tesseract_download("fra") # Frenchtesseract::tesseract_download("zho") # Chinese (simplified)tesseract::tesseract_download("ara") # Arabictesseract::tesseract_download("hin") # Hindi (Devanagari)
Code
# Initialise an engine for a specific languagedeu <- tesseract::tesseract("deu")fra <- tesseract::tesseract("fra")# OCR a German-language PDFgerman_text <- tesseract::ocr("path/to/german_document.pdf", engine = deu)
Multi-Language Documents
If a document contains text in more than one language, you can initialise a combined engine by passing a +-separated language string:
Code
# Engine that handles both English and Germaneng_deu <- tesseract::tesseract("eng+deu")mixed_text <- tesseract::ocr("mixed_language_doc.pdf", engine = eng_deu)
Recognition accuracy decreases somewhat with combined engines compared to single-language engines, so use this only when necessary.
Engine Configuration Options
The tesseract engine exposes many configuration parameters via the options argument of tesseract::tesseract(). The most practically useful are:
Code
# Page segmentation modes (psm) control how Tesseract analyses page layout:# 1 = Automatic page segmentation with OSD (orientation and script detection)# 3 = Fully automatic page segmentation (default)# 6 = Assume a single uniform block of text# 11 = Sparse text — find as much text as possible in no particular order# 13 = Raw line — treat the image as a single text line# OCR engine modes (oem):# 0 = Legacy Tesseract engine only# 1 = Neural nets LSTM engine only (best for most modern documents)# 2 = Legacy + LSTM engines combined# 3 = Default (based on what is available)# Example: configure for a clean single-column documenteng_clean <- tesseract::tesseract(language ="eng",options =list(tessedit_pageseg_mode =6, # single uniform blocktessedit_ocr_engine_mode =1# LSTM only (most accurate) ))# Example: configure for sparse or noisy text (e.g. forms, tables)eng_sparse <- tesseract::tesseract(language ="eng",options =list(tessedit_pageseg_mode =11))
Choosing the Page Segmentation Mode
The default mode (psm = 3, fully automatic) works well for most documents with a standard single- or multi-column layout. Use psm = 6 for clean, uniform text blocks (academic papers, novels). Use psm = 11 for heavily fragmented layouts such as invoices, forms, or partially damaged scans. Use psm = 13 for single lines of text, such as captions or labels.
Handling Multi-Page Scanned PDFs
Scanned PDFs often contain many pages, each stored as a raster image. tesseract::ocr() handles multi-page PDFs natively — it renders each page as an image internally and runs OCR on each in sequence. However, for very large documents it can be useful to process pages in parallel or to save intermediate results to disk to avoid having to re-run expensive OCR if the process is interrupted.
Code
# For a large scanned PDF: process page by page and save intermediate resultslarge_pdf <-"path/to/large_scanned_document.pdf"output_dir <- here::here("data", "ocr_pages")dir.create(output_dir, showWarnings =FALSE)# Get total number of pagesn_pages <- pdftools::pdf_info(large_pdf)$pagescat("Total pages:", n_pages, "\n")# Process each page individually; save to disk as we gofor (i inseq_len(n_pages)) { out_file <-file.path(output_dir, sprintf("page_%04d.txt", i))# Skip pages already processed (allows resuming after interruption)if (file.exists(out_file)) next page_text <- tesseract::ocr(large_pdf, engine = eng, pages = i)writeLines(page_text, con = out_file)if (i %%10==0) message("Processed page ", i, " of ", n_pages)}# Reassemble all pages into a single textpage_files <-list.files(output_dir, pattern ="\\.txt$",full.names =TRUE)full_text <-sapply(page_files, readLines) |>unlist() |>paste0(collapse =" ") |> stringr::str_squish()
Code
# Alternative: parallel processing with furrr (requires the furrr package)# install.packages("furrr")library(furrr)future::plan(multisession, workers =4) # use 4 CPU coresn_pages <- pdftools::pdf_info(large_pdf)$pagespage_texts <- furrr::future_map_chr(seq_len(n_pages),~tesseract::ocr(large_pdf, engine = eng, pages = .x) |>paste0(collapse =" "),.progress =TRUE)full_text_parallel <-paste0(page_texts, collapse =" ")
OCR Is Slow
Tesseract processes approximately 1–5 pages per minute depending on page resolution, image quality, page segmentation mode, and hardware. A 200-page scanned book may take 40 minutes to an hour. Always save intermediate results page-by-page (as shown above) so that you can resume without reprocessing completed pages if R crashes or times out.
Pre-Processing Images to Improve OCR Accuracy
OCR accuracy depends heavily on image quality. For documents that produce poor results, pre-processing the page images before OCR can substantially improve recognition. The magick package (a wrapper around ImageMagick) provides the tools most commonly needed.
Code
# install.packages("magick")library(magick)improve_ocr <-function(pdf_path, engine = eng) {# Convert PDF pages to high-resolution PNG images imgs <- magick::image_read_pdf(pdf_path, density =300) # 300 dpi page_texts <-sapply(seq_along(imgs), function(i) { img <- imgs[i] |> magick::image_convert(type ="Grayscale") |># convert to greyscale magick::image_contrast(sharpen =1) |># enhance contrast magick::image_despeckle() |># remove noise magick::image_deskew(threshold =40) # straighten tilted pages# Run OCR on the pre-processed image tesseract::ocr(img, engine = engine) })paste0(page_texts, collapse =" ")}# Apply to a scanned PDFclean_text <-improve_ocr("path/to/noisy_scan.pdf", engine = eng)
Most Impactful Pre-Processing Steps
In order of typical impact on OCR accuracy:
Resolution — scan/render at 300 dpi minimum; 400–600 dpi for small or degraded fonts
Deskew — correct page rotation introduced during scanning
Greyscale conversion — remove colour information that can confuse character detection
Contrast enhancement — improve separation between ink and background
Despeckle — remove noise from scanner sensors or damaged paper
Spell Checking and Correction
Section Overview
What you’ll learn: How to check OCR output for non-dictionary words using hunspell, how to generate and apply automated spelling suggestions, and how to identify and review the most frequent OCR errors
Even high-quality OCR produces errors — especially for degraded documents, unusual fonts, or non-standard layouts. Common OCR error patterns include: l mistaken for 1 or I, rn mistaken for m, cl mistaken for d, and hyphenated line-break artefacts. Automated spell-checking cannot catch all errors (particularly proper nouns, technical terms, or correctly spelled but contextually wrong words), but it is a fast and effective first pass for cleaning OCR output.
Tokenising and Checking Spelling
hunspell::hunspell_parse() splits text into word tokens. hunspell::hunspell_check() returns TRUE for each token that is found in the dictionary and FALSE for each token that is not.
Code
# Tokenise OCR output into word vectors (one vector per document)tokens_ocr <-lapply(ocrs, function(x) { hunspell::hunspell_parse(x, dict = hunspell::dictionary("en_US"))[[1]]})# How many tokens per document?sapply(tokens_ocr, length)
# Check which tokens are in the dictionaryspelling_check <-lapply(tokens_ocr, function(toks) {data.frame(token = toks,correct = hunspell::hunspell_check(toks,dict = hunspell::dictionary("en_US")) )})# Proportion of correctly spelled tokens per documentsapply(spelling_check, function(x) round(mean(x$correct) *100, 1))
Before applying any automated correction, it is worth inspecting the most frequent non-dictionary tokens. Many will be proper nouns, technical terms, or hyphenated compounds that are perfectly correct — these should be added to an ignore list rather than corrected.
Code
# Collect all non-dictionary tokens across all documentsall_errors <-lapply(spelling_check, function(x) { x$token[!x$correct]}) |>unlist()# Frequency table of the 20 most common non-dictionary tokenserror_freq <-sort(table(all_errors), decreasing =TRUE)data.frame(token =names(error_freq),count =as.integer(error_freq)) |>head(20) |> flextable::flextable() |> flextable::set_table_properties(width = .5, layout ="autofit") |> flextable::theme_zebra() |> flextable::fontsize(size =10) |> flextable::set_caption(caption ="20 most frequent non-dictionary tokens across all OCR outputs." ) |> flextable::border_outer()
token
count
https
160
http
79
doi
71
www
71
edu
31
wikipedia
29
von
26
nih
21
Trask
21
ncbi
20
de
17
NLP
17
ae
16
ee
14
PMC
14
Awww
13
html
13
nim
13
pdf
13
PMID
13
Generating Spelling Suggestions
hunspell::hunspell_suggest() returns a list of candidate corrections for each non-dictionary token, ranked by edit distance from the input. We take the first (best) suggestion where one is available.
Code
# Get suggestions for the 20 most common errorstop_errors <-names(error_freq)[1:20]suggestions <- hunspell::hunspell_suggest( top_errors,dict = hunspell::dictionary("en_US"))# Build a review table: original token + best suggestionsuggestion_df <-data.frame(token = top_errors,suggestion =sapply(suggestions, function(s) {if (length(s) ==0) NA_character_else s[1] }))suggestion_df |> flextable::flextable() |> flextable::set_table_properties(width = .5, layout ="autofit") |> flextable::theme_zebra() |> flextable::fontsize(size =10) |> flextable::set_caption(caption ="Top 20 non-dictionary tokens with best hunspell correction suggestion." ) |> flextable::border_outer()
token
suggestion
https
HTTP
http
HTTP
doi
dpi
www
WWW
edu
ed
wikipedia
Wikipedia
von
con
nih
NIH
Trask
Task
ncbi
cabin
de
DE
NLP
NIP
ae
eye
ee
i
PMC
PM
Awww
WWW
html
HTML
nim
min
pdf
PDF
PMID
MID
Always Review Before Applying
Automated suggestions should be reviewed before being applied. hunspell_suggest() picks candidates purely based on character edit distance — it has no knowledge of context and will frequently suggest plausible-looking but wrong corrections. For example, the OCR error cornputer might be correctly suggested as computer, but rnodels might be suggested as models or noodles with equal confidence. Always check the suggestion table manually and build a curated correction dictionary for your specific document type.
Applying a Curated Correction Dictionary
The recommended workflow is to review the suggestion table, manually confirm or override each correction, and then apply the full set of corrections as a batch string replacement.
Code
# Define a curated correction dictionary after manual review# (example entries — adjust based on your actual OCR errors)correction_dict <-c("cornputer"="computer","languagc"="language","analysls"="analysis","iinguistics"="linguistics","processlng"="processing")# Apply corrections to all OCR textsapply_corrections <-function(text, dict) {for (wrong innames(dict)) { text <- stringr::str_replace_all( text,pattern =paste0("\\b", wrong, "\\b"),replacement = dict[[wrong]] ) }return(text)}corrected_texts <-sapply(ocrs, apply_corrections, dict = correction_dict)
Simple Automated Correction (Aggressive Mode)
If you prefer a fully automated approach and are willing to accept some incorrect corrections in exchange for speed, the following pipeline replaces every non-dictionary token with the best available suggestion. Use this with caution on documents containing technical vocabulary, proper names, or non-standard spellings.
Code
# Automated correction: replace every non-dictionary token with best suggestion# WARNING: will incorrectly "correct" proper nouns and technical termsclean_ocrtext <-sapply(tokens_ocr, function(toks) { correct <- hunspell::hunspell_check(toks,dict = hunspell::dictionary("en_US")) suggs <- hunspell::hunspell_suggest(toks[!correct],dict = hunspell::dictionary("en_US"))# Replace non-dictionary tokens with first suggestion (if available) toks[!correct] <-sapply(suggs, function(s) {if (length(s) ==0) NA_character_else s[1] })# Remove tokens for which no suggestion was found toks <- toks[!is.na(toks)]paste0(toks, collapse =" ")})
substr(clean_ocrtext, 1, 800)
Corpus linguistics Wikipedia HTTP en Wikipedia org wiki Corpus linguistics WIKIPEDIA e e e Corpus linguistics Corpus linguistics is the study of language as expressed in corpora samples of real world text Corpus linguistics proposes that reliable language analysis is more feasible with corpora collected in the field in its natural context regalia and with minimal experimental interference The field of corpus linguistics features divergent views about the value of corpus annotation These views range from John Hardy Sinclair who advocates minimal annotation so texts speak for themselves to the Survey of English Usage team University College London who advocate annotation as allowing greater linguistic understanding through rigorous recording The text corpus method is a digestive approach tha
Corpus linguistics Wikipedia HTTP en Wikipedia org wiki Corpus linguistics WIKIPEDIA e e e Corpus linguistics Corpus linguistics is the study of language as expressed in corpora samples of real world text Corpus linguistics proposes that reliable language analysis is more feasible with corpora collected in the field in its natural context regalia and with minimal experimental interference The field of corpus linguistics features divergent views about the value of corpus annotation These views range from John Hardy Sinclair who advocates minimal annotation so texts speak for themselves to the Survey of English Usage team University College London who advocate annotation as allowing greater linguistic understanding through rigorous recording The text corpus method is a digestive approach tha
Language Wikipedia HTTP en Wikipedia org wiki Language WIKIPEDIA Language A language is a structured system of communication Language in a broader sense is the method of communication that involves the use of particularly human languages J II The scientific study of language is called linguistics Questions concerning the philosophy of language such as whether words can represent experience have been g debated at least since Gorgas and Plato in ancient Greece Thinkers such as Rousseau have argued that language originated from emotions while others like Kant have held that it originated from rational and logical thought ht century philosophers such as Wittgenstein argued that philosophy is really the study of language Major Pay ti figures in linguistics include Ferdinand DE Saussure and Nam
Language Wikipedia HTTP en Wikipedia org wiki Language WIKIPEDIA Language A language is a structured system of communication Language in a broader sense is the method of communication that involves the use of particularly human languages J II The scientific study of language is called linguistics Questions concerning the philosophy of language such as whether words can represent experience have been g debated at least since Gorgas and Plato in ancient Greece Thinkers such as Rousseau have argued that language originated from emotions while others like Kant have held that it originated from rational and logical thought ht century philosophers such as Wittgenstein argued that philosophy is really the study of language Major Pay ti figures in linguistics include Ferdinand DE Saussure and Nam
Natural language processing Wikipedia HTTP en Wikipedia org wiki Natural language processing WIKIPEDIA e Natural language processing Natural language processing NIP is a sub field of linguistics computer science information engineering and artificial intelligence concerned with the interactions between computers and human natural languages in particular how to program computers to process and analyze large amounts of natural language data Ce i A Challenges in natural language processing frequently involve speech recognition natural language understanding and natural language generation u Contents nae History eugenics Rule based vs statistical NIP chm Hi I'm your automated online yer corer s Major evaluations and tasks scat How may NT oi f ks Syntax portieres Semantics An automated online a
Natural language processing Wikipedia HTTP en Wikipedia org wiki Natural language processing WIKIPEDIA e Natural language processing Natural language processing NIP is a sub field of linguistics computer science information engineering and artificial intelligence concerned with the interactions between computers and human natural languages in particular how to program computers to process and analyze large amounts of natural language data Ce i A Challenges in natural language processing frequently involve speech recognition natural language understanding and natural language generation u Contents nae History eugenics Rule based vs statistical NIP chm Hi I'm your automated online yer corer s Major evaluations and tasks scat How may NT oi f ks Syntax portieres Semantics An automated online a
Computational linguistics Wikipedia HTTP en Wikipedia org wiki Computational linguistics WIKIPEDIA e e e e Computational linguistics Computational linguistics is an interdisciplinary field concerned with the statistical or rule based modeling of natural language from a computational perspective as well as the study of appropriate computational approaches to linguistic questions Traditionally computational linguistics was performed by computer scientists who had specialized in the application of computers to the processing of a natural language Today computational linguists often work as members of interdisciplinary teams which can include regular linguists experts in the target language and computer scientists In general computational linguistics draws upon the involvement of linguists com
Computational linguistics Wikipedia HTTP en Wikipedia org wiki Computational linguistics WIKIPEDIA e e e e Computational linguistics Computational linguistics is an interdisciplinary field concerned with the statistical or rule based modeling of natural language from a computational perspective as well as the study of appropriate computational approaches to linguistic questions Traditionally computational linguistics was performed by computer scientists who had specialized in the application of computers to the processing of a natural language Today computational linguists often work as members of interdisciplinary teams which can include regular linguists experts in the target language and computer scientists In general computational linguistics draws upon the involvement of linguists com
Putting It All Together
Section Overview
What you’ll learn: A complete, production-ready workflow function that selects the appropriate extraction method (pdftools or tesseract), extracts text, and optionally applies spell correction — all in a single call
The code below wraps the full pipeline into a single reusable function. It accepts a path to a PDF or a directory of PDFs, detects whether each file has an embedded text layer (and falls back to tesseract if not), and optionally applies spell correction.
Code
#' Extract text from one or more PDFs, choosing the best method automatically#'#' @param path Path to a single PDF file or a directory containing PDFs#' @param lang Tesseract language code (default: "eng")#' @param spell_correct Apply automated spell correction to OCR output?#' @param min_chars_per_page Minimum characters per page to consider text#' layer valid (below this, fall back to tesseract)#' @return Named character vector of extracted textsextract_pdf_text <-function(path,lang ="eng",spell_correct =FALSE,min_chars_per_page =50) {# Resolve input: single file or directoryif (dir.exists(path)) { files <-list.files(path, pattern ="\\.pdf$",full.names =TRUE, ignore.case =TRUE) } elseif (file.exists(path)) { files <- path } else {stop("Path does not exist: ", path) } engine <- tesseract::tesseract(lang) results <-sapply(files, function(f) {# Try pdftools first; check whether the text layer is usable pages_raw <- pdftools::pdf_text(f) avg_chars <-mean(nchar(stringr::str_squish(pages_raw))) has_textlayer <- avg_chars >= min_chars_per_pageif (has_textlayer) {message(basename(f), ": using pdftools (text layer detected)") txt <- pages_raw |>paste0(collapse =" ") |> stringr::str_squish() } else {message(basename(f), ": using tesseract (no usable text layer)") txt <- tesseract::ocr(f, engine = engine) |>paste0(collapse =" ") |> stringr::str_squish()if (spell_correct) { toks <- hunspell::hunspell_parse(txt,dict = hunspell::dictionary("en_US"))[[1]] correct <- hunspell::hunspell_check(toks,dict = hunspell::dictionary("en_US")) suggs <- hunspell::hunspell_suggest(toks[!correct],dict = hunspell::dictionary("en_US")) toks[!correct] <-sapply(suggs, function(s) {if (length(s) ==0) NA_character_else s[1] }) toks <- toks[!is.na(toks)] txt <-paste0(toks, collapse =" ") } }return(txt) }, USE.NAMES =TRUE)names(results) <- tools::file_path_sans_ext(basename(files))return(results)}# --- Usage examples ----------------------------------------------------------# Single file — auto-detect methodtext1 <-extract_pdf_text("tutorials/pdf2txt/data/PDFs/pdf0.pdf")# Directory — auto-detect method for each filetexts_auto <-extract_pdf_text("tutorials/pdf2txt/data/PDFs/")# Directory — force spell correction for OCR fallback filestexts_corrected <-extract_pdf_text("tutorials/pdf2txt/data/PDFs/", spell_correct =TRUE)# Non-English documenttext_de <-extract_pdf_text("tutorials/pdf2txt/data/PDFs/german_report.pdf", lang ="deu")
Summary
This how-to has covered the complete PDF-to-text workflow in R:
Choosing a tool.pdftools is the right choice for digitally generated PDFs with an embedded text layer — it is fast, requires no external dependencies beyond Poppler, and preserves the document’s pagination and layout. tesseract is the right choice for scanned documents and image-based PDFs — it is slower but handles content that pdftools cannot access at all.
Beyond basic extraction.pdftools also provides document metadata (pdf_info()), page dimensions (pdf_pagesize()), and font information (pdf_fonts()), all of which are useful for provenance tracking and diagnosing encoding problems. tesseract supports over 100 languages via downloadable language models and exposes configuration parameters for page segmentation mode and OCR engine selection that can significantly improve accuracy on challenging documents.
Pre-processing and spell correction. For noisy scans, pre-processing the page images with magick (greyscale conversion, contrast enhancement, deskewing, despeckling) before OCR substantially improves recognition accuracy. Post-OCR spell checking with hunspell identifies non-dictionary tokens and can generate correction candidates, but automated correction should always be reviewed manually before application — particularly for documents containing proper nouns, technical vocabulary, or non-standard spelling conventions.
Production-ready workflow. The extract_pdf_text() function presented in the final section wraps the full pipeline into a single call that automatically detects whether each PDF has a usable text layer and selects the appropriate extraction method accordingly.
Citation and Session Info
Schweinberger, Martin. 2026. Converting PDFs to Text with R. Brisbane: The Language Technology and Data Analysis Laboratory (LADAL). url: https://ladal.edu.au/tutorials/pdf2txt/pdf2txt.html (Version 2026.02.24).
@manual{schweinberger2026pdf2txt,
author = {Schweinberger, Martin},
title = {Converting PDFs to Text with R},
note = {https://ladal.edu.au/tutorials/pdf2txt/pdf2txt.html},
year = {2026},
organization = {The Language Technology and Data Analysis Laboratory (LADAL)},
address = {Brisbane},
edition = {2026.02.24}
}
This how-to was revised and substantially expanded with the assistance of Claude (claude.ai), a large language model created by Anthropic. Claude was used to restructure the document into Quarto format, add the pdftools vs tesseract comparison section, expand the pdftools section with metadata, page-level, and batch-processing examples, expand the tesseract section with language support and engine configuration, write the new multi-page scanned PDF and image pre-processing sections, expand the spell-checking section with a suggested-correction workflow and curated dictionary approach, and write the production-ready extract_pdf_text() wrapper function. All content was reviewed, edited, and approved by the author (Martin Schweinberger), who takes full responsibility for the accuracy of the material.
---title: "Converting PDFs to Text with R"author: "Martin Schweinberger"format: html: toc: true toc-depth: 4 code-fold: show code-tools: true theme: cosmo---```{r setup, echo=FALSE, message=FALSE, warning=FALSE}options(stringsAsFactors = FALSE)options("scipen" = 100, "digits" = 12)```{ width=100% }# Introduction {#intro}{ width=15% style="float:right; padding:10px" }PDF (*Portable Document Format*) is one of the most pervasive formats for distributing text — research articles, government reports, historical documents, scanned books, and court judgements are routinely archived as PDFs. For linguistic and computational research, the ability to extract clean, machine-readable text from PDFs is therefore a fundamental data-preparation skill. This tutorial shows how to do that efficiently and reliably in R.Two complementary packages are covered:- **`pdftools`** — fast, dependency-light extraction for digitally generated PDFs (PDFs rendered from Word, LaTeX, InDesign, or a web browser, where the text layer is embedded in the file)- **`tesseract`** — slower but more robust OCR (Optical Character Recognition) for image-based PDFs, scanned documents, faxes, and any PDF where the text is stored as a raster image rather than as selectable charactersThe tutorial also covers how to extract document metadata and page-level information with `pdftools`, how to configure the `tesseract` engine for different languages, how to handle multi-page scanned PDFs, and how to combine OCR output with automated spell-checking and suggested-correction workflows using `hunspell`.::: {.callout-note}## Prerequisite TutorialsBefore working through this how-to, familiarity with the following is recommended:- [Getting Started with R and RStudio](/tutorials/intror/intror.html)- [Loading, Saving, and Generating Data in R](/tutorials/load/load.html)- [String Processing in R](/tutorials/string/string.html):::::: {.callout-note}## CitationSchweinberger, Martin. 2026. *Converting PDFs to Text with R*. Brisbane: The Language Technology and Data Analysis Laboratory (LADAL). url: https://ladal.edu.au/tutorials/pdf2txt/pdf2txt.html (Version 2026.02.24).:::---# pdftools vs tesseract: Choosing the Right Tool {#choosing}Before writing a single line of code, the most important decision is which extraction tool to use. The answer depends entirely on *how the PDF was created*.| Situation | Recommended tool ||-----------|-----------------|| PDF rendered from Word, LaTeX, or InDesign | `pdftools` || PDF saved from a web browser or exported from software | `pdftools` || Scanned physical document (book, report, fax) | `tesseract` || PDF of a photograph or image | `tesseract` || PDF with embedded fonts but garbled character encoding | `tesseract` || Mixed PDF (some pages have text layer, others are scanned) | `tesseract` for scanned pages; `pdftools` for text pages || Non-Latin script (Arabic, Chinese, Devanagari, etc.) | `tesseract` with the appropriate language model |: Choosing between pdftools and tesseract {tbl-colwidths="[65,35]"}**How to tell which type you have:** Open the PDF in a PDF viewer and try to select and copy a word. If you can select individual characters and the copied text is legible, the PDF has an embedded text layer and `pdftools` is the right choice. If selecting text is impossible, produces garbled output, or selects only whole blocks, the PDF is image-based and you need `tesseract`.::: {.callout-tip}## Quick Diagnostic in R```{r diagnose, eval=FALSE}# A quick way to check whether a PDF has a usable text layer:# if nchar() returns 0 or near-zero for all pages, use tesseract insteadlibrary(pdftools)test <- pdftools::pdf_text("your_file.pdf")nchar(test) # characters per page — 0 means no text layer```:::---# Setup {#setup}## Installing Packages {-}```{r prep1, echo=TRUE, eval=FALSE, message=FALSE, warning=FALSE}# Run once to install — comment out after installationinstall.packages("pdftools")install.packages("tesseract")install.packages("tidyverse")install.packages("here")install.packages("hunspell")install.packages("flextable")```::: {.callout-warning}## System Dependencies for tesseractThe `tesseract` R package is a wrapper around the [Tesseract OCR engine](https://opensource.google/projects/tesseract), which must be installed separately as a system library before the R package will work.- **Windows**: Download and run the installer from [github.com/UB-Mannheim/tesseract/wiki](https://github.com/UB-Mannheim/tesseract/wiki). After installation, make sure the Tesseract binary folder is on your system PATH.- **macOS**: Run `brew install tesseract` in a Terminal (requires [Homebrew](https://brew.sh/)).- **Linux (Debian/Ubuntu)**: Run `sudo apt-get install tesseract-ocr` in a Terminal.Additional language packs for non-English OCR can be installed separately — see the [Language Support](#language-support) section below.:::## Loading Packages {-}```{r prep2, echo=TRUE, eval=TRUE, message=FALSE, warning=FALSE}library(pdftools) # text-layer PDF extraction and metadatalibrary(tesseract) # OCR for image-based PDFslibrary(tidyverse) # dplyr, stringr, purrrlibrary(here) # portable file pathslibrary(hunspell) # spell checking and correctionlibrary(flextable) # formatted display tables# Initialise the English Tesseract engine once and reuse iteng <- tesseract::tesseract("eng")```## Data and Folder Setup {-}The code in this tutorial assumes the following folder structure within your R project:```your_project/├── data/│ └── PDFs/│ ├── pdf0.pdf (Wikipedia: Corpus linguistics)│ ├── pdf1.pdf (Wikipedia: Linguistics)│ ├── pdf2.pdf (Wikipedia: Natural language processing)│ └── pdf3.pdf (Wikipedia: Computational linguistics)└── pdf2txt.qmd```Download the four sample PDF files from the links below and save them in `data/PDFs/`:[pdf0](/tutorials/pdf2txt/data/PDFs/pdf0.pdf) · [pdf1](/tutorials/pdf2txt/data/PDFs/pdf1.pdf) · [pdf2](/tutorials/pdf2txt/data/PDFs/pdf2.pdf) · [pdf3](/tutorials/pdf2txt/data/PDFs/pdf3.pdf)---# Text Extraction with pdftools {#pdftools}::: {.callout-note}## Section Overview**What you'll learn:** How to extract text from a single PDF and from a folder of PDFs using `pdftools`, how to retrieve document metadata and page-level information, how to work with page numbers, and how to save extracted text to disk:::The `pdftools` package [@ooms2022pdftools] provides fast, dependency-light text extraction for PDFs that have an embedded text layer. It wraps the [Poppler PDF rendering library](https://poppler.freedesktop.org/) and works without any external system dependencies beyond Poppler itself (which is bundled with the package on Windows and macOS).## Extracting Text from a Single PDF {-}The workhorse function is `pdftools::pdf_text()`. It returns a character vector with one element per page; we paste the pages together and collapse any internal whitespace with `str_squish()`.```{r pconv01, echo=TRUE, eval=TRUE, message=FALSE, warning=FALSE}# Path to the PDF (Wikipedia article on corpus linguistics)pdf_path <- "tutorials/pdf2txt/data/PDFs/pdf0.pdf"# Extract text: one element per pagepages <- pdftools::pdf_text(pdf_path)cat("Pages extracted:", length(pages), "\n")# Collapse all pages into a single string and clean whitespacetxt_output <- pages |> paste0(collapse = " ") |> stringr::str_squish()``````{r pconv02, echo=FALSE, eval=TRUE, message=FALSE, warning=FALSE}txt_output |> substr(1, 1000) |> as.data.frame() |> flextable::flextable() |> flextable::set_table_properties(width = .95, layout = "autofit") |> flextable::theme_zebra() |> flextable::fontsize(size = 10) |> flextable::set_caption( caption = "First 1,000 characters extracted from the Wikipedia article on corpus linguistics (pdf0.pdf)." ) |> flextable::border_outer()```::: {.callout-tip}## Working Page by PageSometimes it is more useful to keep the page structure rather than collapsing everything into one string — for instance, when you need to track which page a quote came from, or when processing very large PDFs that would be unwieldy as a single string. In that case, work with the `pages` vector directly:```{r page_by_page, eval=FALSE}# Process pages individually: clean each page separatelypages_clean <- pages |> purrr::map_chr(stringr::str_squish)# Inspect the second pagecat(pages_clean[2])# Create a data frame with one row per pagepage_df <- data.frame( page = seq_along(pages_clean), text = pages_clean)```:::## Extracting Document Metadata {-}`pdftools::pdf_info()` returns a rich list of document metadata: title, author, creation date, modification date, PDF version, page dimensions, and more. This information is useful for provenance tracking and for verifying that you have the right document.```{r meta01, eval=TRUE, message=FALSE, warning=FALSE}meta <- pdftools::pdf_info(pdf_path)# Display selected metadata fieldsdata.frame( Field = c("Pages", "PDF version", "Title", "Author", "Creator", "Created", "Modified"), Value = c( meta$pages, meta$version, ifelse(is.null(meta$keys$Title), "—", meta$keys$Title), ifelse(is.null(meta$keys$Author), "—", meta$keys$Author), ifelse(is.null(meta$keys$Creator), "—", meta$keys$Creator), format(meta$created, "%Y-%m-%d %H:%M"), format(meta$modified, "%Y-%m-%d %H:%M") )) |> flextable::flextable() |> flextable::set_table_properties(width = .75, layout = "autofit") |> flextable::theme_zebra() |> flextable::fontsize(size = 10) |> flextable::set_caption(caption = "Document metadata extracted from pdf0.pdf.") |> flextable::border_outer()```## Extracting Page-Level Information {-}`pdftools::pdf_pagesize()` returns the width and height of each page (in points, where 1 point = 1/72 inch). This is useful for detecting mixed-orientation documents (portrait and landscape pages) or for understanding the physical layout of tables and figures.`pdftools::pdf_fonts()` lists the fonts embedded in the document — helpful for diagnosing encoding problems or unusual character sets.```{r meta02, eval=TRUE, message=FALSE, warning=FALSE}# Page dimensions (in points)page_sizes <- pdftools::pdf_pagesize(pdf_path)head(page_sizes, 3)``````{r meta03, eval=TRUE, message=FALSE, warning=FALSE}# Embedded fontsfonts <- pdftools::pdf_fonts(pdf_path)head(fonts, 6)```## Extracting Text from Many PDFs {-}For batch processing, we write a reusable function that takes a folder path, finds all PDF files, extracts and cleans their text, and returns a named character vector.```{r pconv03, eval=TRUE, message=FALSE, warning=FALSE}convertpdf2txt <- function(dirpath, pattern = "\\.pdf$") { files <- list.files(dirpath, pattern = pattern, full.names = TRUE, ignore.case = TRUE) if (length(files) == 0) stop("No PDF files found in: ", dirpath) texts <- sapply(files, function(f) { pdftools::pdf_text(f) |> paste0(collapse = " ") |> stringr::str_squish() }, USE.NAMES = TRUE) # Use clean base names (without path and extension) as names names(texts) <- tools::file_path_sans_ext(basename(files)) return(texts)}``````{r pconv05, eval=TRUE, message=FALSE, warning=FALSE}txts <- convertpdf2txt(here::here("tutorials/pdf2txt/data/PDFs"))cat("Texts extracted:", length(txts), "\n")cat("Names:", paste(names(txts), collapse = ", "), "\n")``````{r pconv06, echo=FALSE, eval=TRUE, message=FALSE, warning=FALSE}txts |> substr(1, 800) |> as.data.frame() |> flextable::flextable() |> flextable::set_table_properties(width = .95, layout = "autofit") |> flextable::theme_zebra() |> flextable::fontsize(size = 10) |> flextable::set_caption( caption = "First 800 characters of each extracted text (4 Wikipedia articles on language technology topics)." ) |> flextable::border_outer()```## Building a Page-Level Data Frame {-}For downstream corpus analysis it is often useful to have a tidy data frame with one row per page, carrying the document name and page number alongside the text. This structure integrates naturally with `dplyr` pipelines.```{r page_df, eval=TRUE, message=FALSE, warning=FALSE}files <- list.files(here::here("tutorials/pdf2txt/data/PDFs"), pattern = "pdf$", full.names = TRUE, ignore.case = TRUE)page_corpus <- purrr::map_dfr(files, function(f) { pages <- pdftools::pdf_text(f) data.frame( document = tools::file_path_sans_ext(basename(f)), page = seq_along(pages), text = stringr::str_squish(pages), stringsAsFactors = FALSE )})head(page_corpus[, c("document", "page")], 8)cat("Total pages across all documents:", nrow(page_corpus), "\n")```## Saving Extracted Texts to Disk {-}```{r pconv07, eval=FALSE, message=FALSE, warning=FALSE}# Save each text as a .txt file in data/txts/output_dir <- here::here("tutorials/pdf2txt/data/txts")dir.create(output_dir, showWarnings = FALSE)lapply(seq_along(txts), function(i) { out_path <- file.path(output_dir, paste0(names(txts)[i], ".txt")) writeLines(text = txts[[i]], con = out_path) message("Saved: ", out_path)})```---# OCR with tesseract {#tesseract}::: {.callout-note}## Section Overview**What you'll learn:** How to perform OCR on image-based PDFs using `tesseract`, how to configure the OCR engine, how to use non-English language models, and how to handle multi-page scanned PDFs:::The `tesseract` package [@ooms2023tesseract] provides R bindings for [Google's Tesseract OCR engine](https://opensource.google/projects/tesseract), an open-source OCR system that supports over 100 languages. Unlike `pdftools`, which reads an embedded text layer, `tesseract` analyses the *image content* of each page and attempts to identify characters from their visual appearance. This makes it the right tool for scanned documents, photographs of text, and any PDF where the text is stored as pixels rather than as Unicode characters.## Basic OCR on a Folder of PDFs {-}```{r ocr01, eval=TRUE, message=FALSE, warning=FALSE}fls <- list.files(here::here("tutorials/pdf2txt/data/PDFs"), full.names = TRUE)ocrs <- sapply(fls, function(x) { nm <- tools::file_path_sans_ext(basename(x)) txt <- tesseract::ocr(x, engine = eng) |> paste0(collapse = " ") return(txt)}, USE.NAMES = TRUE)names(ocrs) <- tools::file_path_sans_ext(basename(fls))``````{r tesout1, echo=FALSE, eval=TRUE, message=FALSE, warning=FALSE}ocrs |> substr(1, 800) |> as.data.frame() |> flextable::flextable() |> flextable::set_table_properties(width = .95, layout = "autofit") |> flextable::theme_zebra() |> flextable::fontsize(size = 10) |> flextable::set_caption( caption = "First 800 characters of OCR output for each of the four Wikipedia article PDFs." ) |> flextable::border_outer()```## Language Support {#language-support}By default, `tesseract()` uses the English language model (`"eng"`). For documents in other languages, you must first install the relevant Tesseract language pack and then initialise an engine with that language code.```{r lang01, eval=FALSE, message=FALSE, warning=FALSE}# List all language packs already installed on your systemtesseract::tesseract_info()$available# Install additional language packs from within R# (downloads the trained model data from the tesseract-ocr GitHub repository)tesseract::tesseract_download("deu") # Germantesseract::tesseract_download("fra") # Frenchtesseract::tesseract_download("zho") # Chinese (simplified)tesseract::tesseract_download("ara") # Arabictesseract::tesseract_download("hin") # Hindi (Devanagari)``````{r lang02, eval=FALSE, message=FALSE, warning=FALSE}# Initialise an engine for a specific languagedeu <- tesseract::tesseract("deu")fra <- tesseract::tesseract("fra")# OCR a German-language PDFgerman_text <- tesseract::ocr("path/to/german_document.pdf", engine = deu)```::: {.callout-tip}## Multi-Language DocumentsIf a document contains text in more than one language, you can initialise a combined engine by passing a `+`-separated language string:```{r multilang, eval=FALSE}# Engine that handles both English and Germaneng_deu <- tesseract::tesseract("eng+deu")mixed_text <- tesseract::ocr("mixed_language_doc.pdf", engine = eng_deu)```Recognition accuracy decreases somewhat with combined engines compared to single-language engines, so use this only when necessary.:::## Engine Configuration Options {-}The `tesseract` engine exposes many configuration parameters via the `options` argument of `tesseract::tesseract()`. The most practically useful are:```{r engine_config, eval=FALSE, message=FALSE, warning=FALSE}# Page segmentation modes (psm) control how Tesseract analyses page layout:# 1 = Automatic page segmentation with OSD (orientation and script detection)# 3 = Fully automatic page segmentation (default)# 6 = Assume a single uniform block of text# 11 = Sparse text — find as much text as possible in no particular order# 13 = Raw line — treat the image as a single text line# OCR engine modes (oem):# 0 = Legacy Tesseract engine only# 1 = Neural nets LSTM engine only (best for most modern documents)# 2 = Legacy + LSTM engines combined# 3 = Default (based on what is available)# Example: configure for a clean single-column documenteng_clean <- tesseract::tesseract( language = "eng", options = list( tessedit_pageseg_mode = 6, # single uniform block tessedit_ocr_engine_mode = 1 # LSTM only (most accurate) ))# Example: configure for sparse or noisy text (e.g. forms, tables)eng_sparse <- tesseract::tesseract( language = "eng", options = list(tessedit_pageseg_mode = 11))```::: {.callout-note}## Choosing the Page Segmentation ModeThe default mode (psm = 3, fully automatic) works well for most documents with a standard single- or multi-column layout. Use psm = 6 for clean, uniform text blocks (academic papers, novels). Use psm = 11 for heavily fragmented layouts such as invoices, forms, or partially damaged scans. Use psm = 13 for single lines of text, such as captions or labels.:::## Handling Multi-Page Scanned PDFs {#multipage}Scanned PDFs often contain many pages, each stored as a raster image. `tesseract::ocr()` handles multi-page PDFs natively — it renders each page as an image internally and runs OCR on each in sequence. However, for very large documents it can be useful to process pages in parallel or to save intermediate results to disk to avoid having to re-run expensive OCR if the process is interrupted.```{r multipage01, eval=FALSE, message=FALSE, warning=FALSE}# For a large scanned PDF: process page by page and save intermediate resultslarge_pdf <- "path/to/large_scanned_document.pdf"output_dir <- here::here("data", "ocr_pages")dir.create(output_dir, showWarnings = FALSE)# Get total number of pagesn_pages <- pdftools::pdf_info(large_pdf)$pagescat("Total pages:", n_pages, "\n")# Process each page individually; save to disk as we gofor (i in seq_len(n_pages)) { out_file <- file.path(output_dir, sprintf("page_%04d.txt", i)) # Skip pages already processed (allows resuming after interruption) if (file.exists(out_file)) next page_text <- tesseract::ocr(large_pdf, engine = eng, pages = i) writeLines(page_text, con = out_file) if (i %% 10 == 0) message("Processed page ", i, " of ", n_pages)}# Reassemble all pages into a single textpage_files <- list.files(output_dir, pattern = "\\.txt$", full.names = TRUE)full_text <- sapply(page_files, readLines) |> unlist() |> paste0(collapse = " ") |> stringr::str_squish()``````{r multipage02, eval=FALSE, message=FALSE, warning=FALSE}# Alternative: parallel processing with furrr (requires the furrr package)# install.packages("furrr")library(furrr)future::plan(multisession, workers = 4) # use 4 CPU coresn_pages <- pdftools::pdf_info(large_pdf)$pagespage_texts <- furrr::future_map_chr( seq_len(n_pages), ~tesseract::ocr(large_pdf, engine = eng, pages = .x) |> paste0(collapse = " "), .progress = TRUE)full_text_parallel <- paste0(page_texts, collapse = " ")```::: {.callout-warning}## OCR Is SlowTesseract processes approximately 1–5 pages per minute depending on page resolution, image quality, page segmentation mode, and hardware. A 200-page scanned book may take 40 minutes to an hour. Always save intermediate results page-by-page (as shown above) so that you can resume without reprocessing completed pages if R crashes or times out.:::## Pre-Processing Images to Improve OCR Accuracy {-}OCR accuracy depends heavily on image quality. For documents that produce poor results, pre-processing the page images before OCR can substantially improve recognition. The `magick` package (a wrapper around ImageMagick) provides the tools most commonly needed.```{r preprocess, eval=FALSE, message=FALSE, warning=FALSE}# install.packages("magick")library(magick)improve_ocr <- function(pdf_path, engine = eng) { # Convert PDF pages to high-resolution PNG images imgs <- magick::image_read_pdf(pdf_path, density = 300) # 300 dpi page_texts <- sapply(seq_along(imgs), function(i) { img <- imgs[i] |> magick::image_convert(type = "Grayscale") |> # convert to greyscale magick::image_contrast(sharpen = 1) |> # enhance contrast magick::image_despeckle() |> # remove noise magick::image_deskew(threshold = 40) # straighten tilted pages # Run OCR on the pre-processed image tesseract::ocr(img, engine = engine) }) paste0(page_texts, collapse = " ")}# Apply to a scanned PDFclean_text <- improve_ocr("path/to/noisy_scan.pdf", engine = eng)```::: {.callout-tip}## Most Impactful Pre-Processing StepsIn order of typical impact on OCR accuracy:1. **Resolution** — scan/render at 300 dpi minimum; 400–600 dpi for small or degraded fonts2. **Deskew** — correct page rotation introduced during scanning3. **Greyscale conversion** — remove colour information that can confuse character detection4. **Contrast enhancement** — improve separation between ink and background5. **Despeckle** — remove noise from scanner sensors or damaged paper:::---# Spell Checking and Correction {#spellcheck}::: {.callout-note}## Section Overview**What you'll learn:** How to check OCR output for non-dictionary words using `hunspell`, how to generate and apply automated spelling suggestions, and how to identify and review the most frequent OCR errors:::Even high-quality OCR produces errors — especially for degraded documents, unusual fonts, or non-standard layouts. Common OCR error patterns include: `l` mistaken for `1` or `I`, `rn` mistaken for `m`, `cl` mistaken for `d`, and hyphenated line-break artefacts. Automated spell-checking cannot catch all errors (particularly proper nouns, technical terms, or correctly spelled but contextually wrong words), but it is a fast and effective first pass for cleaning OCR output.## Tokenising and Checking Spelling {-}`hunspell::hunspell_parse()` splits text into word tokens. `hunspell::hunspell_check()` returns `TRUE` for each token that is found in the dictionary and `FALSE` for each token that is not.```{r spell01, message=FALSE, warning=FALSE}# Tokenise OCR output into word vectors (one vector per document)tokens_ocr <- lapply(ocrs, function(x) { hunspell::hunspell_parse(x, dict = hunspell::dictionary("en_US"))[[1]]})# How many tokens per document?sapply(tokens_ocr, length)``````{r spell02, message=FALSE, warning=FALSE}# Check which tokens are in the dictionaryspelling_check <- lapply(tokens_ocr, function(toks) { data.frame( token = toks, correct = hunspell::hunspell_check(toks, dict = hunspell::dictionary("en_US")) )})# Proportion of correctly spelled tokens per documentsapply(spelling_check, function(x) round(mean(x$correct) * 100, 1))```## Reviewing the Most Frequent Errors {-}Before applying any automated correction, it is worth inspecting the most frequent non-dictionary tokens. Many will be proper nouns, technical terms, or hyphenated compounds that are perfectly correct — these should be added to an ignore list rather than corrected.```{r spell03, message=FALSE, warning=FALSE}# Collect all non-dictionary tokens across all documentsall_errors <- lapply(spelling_check, function(x) { x$token[!x$correct]}) |> unlist()# Frequency table of the 20 most common non-dictionary tokenserror_freq <- sort(table(all_errors), decreasing = TRUE)data.frame( token = names(error_freq), count = as.integer(error_freq)) |> head(20) |> flextable::flextable() |> flextable::set_table_properties(width = .5, layout = "autofit") |> flextable::theme_zebra() |> flextable::fontsize(size = 10) |> flextable::set_caption( caption = "20 most frequent non-dictionary tokens across all OCR outputs." ) |> flextable::border_outer()```## Generating Spelling Suggestions {-}`hunspell::hunspell_suggest()` returns a list of candidate corrections for each non-dictionary token, ranked by edit distance from the input. We take the first (best) suggestion where one is available.```{r spell04, message=FALSE, warning=FALSE}# Get suggestions for the 20 most common errorstop_errors <- names(error_freq)[1:20]suggestions <- hunspell::hunspell_suggest( top_errors, dict = hunspell::dictionary("en_US"))# Build a review table: original token + best suggestionsuggestion_df <- data.frame( token = top_errors, suggestion = sapply(suggestions, function(s) { if (length(s) == 0) NA_character_ else s[1] }))suggestion_df |> flextable::flextable() |> flextable::set_table_properties(width = .5, layout = "autofit") |> flextable::theme_zebra() |> flextable::fontsize(size = 10) |> flextable::set_caption( caption = "Top 20 non-dictionary tokens with best hunspell correction suggestion." ) |> flextable::border_outer()```::: {.callout-warning}## Always Review Before ApplyingAutomated suggestions should be reviewed before being applied. `hunspell_suggest()` picks candidates purely based on character edit distance — it has no knowledge of context and will frequently suggest plausible-looking but wrong corrections. For example, the OCR error `cornputer` might be correctly suggested as `computer`, but `rnodels` might be suggested as `models` or `noodles` with equal confidence. Always check the suggestion table manually and build a curated correction dictionary for your specific document type.:::## Applying a Curated Correction Dictionary {-}The recommended workflow is to review the suggestion table, manually confirm or override each correction, and then apply the full set of corrections as a batch string replacement.```{r spell05, message=FALSE, warning=FALSE}# Define a curated correction dictionary after manual review# (example entries — adjust based on your actual OCR errors)correction_dict <- c( "cornputer" = "computer", "languagc" = "language", "analysls" = "analysis", "iinguistics" = "linguistics", "processlng" = "processing")# Apply corrections to all OCR textsapply_corrections <- function(text, dict) { for (wrong in names(dict)) { text <- stringr::str_replace_all( text, pattern = paste0("\\b", wrong, "\\b"), replacement = dict[[wrong]] ) } return(text)}corrected_texts <- sapply(ocrs, apply_corrections, dict = correction_dict)```## Simple Automated Correction (Aggressive Mode) {-}If you prefer a fully automated approach and are willing to accept some incorrect corrections in exchange for speed, the following pipeline replaces every non-dictionary token with the best available suggestion. Use this with caution on documents containing technical vocabulary, proper names, or non-standard spellings.```{r spell06, message=FALSE, warning=FALSE}# Automated correction: replace every non-dictionary token with best suggestion# WARNING: will incorrectly "correct" proper nouns and technical termsclean_ocrtext <- sapply(tokens_ocr, function(toks) { correct <- hunspell::hunspell_check(toks, dict = hunspell::dictionary("en_US")) suggs <- hunspell::hunspell_suggest(toks[!correct], dict = hunspell::dictionary("en_US")) # Replace non-dictionary tokens with first suggestion (if available) toks[!correct] <- sapply(suggs, function(s) { if (length(s) == 0) NA_character_ else s[1] }) # Remove tokens for which no suggestion was found toks <- toks[!is.na(toks)] paste0(toks, collapse = " ")})``````{r tesout3, echo=FALSE, eval=TRUE, message=FALSE, warning=FALSE}clean_ocrtext |> substr(1, 800) |> as.data.frame() |> flextable::flextable() |> flextable::set_table_properties(width = .95, layout = "autofit") |> flextable::theme_zebra() |> flextable::fontsize(size = 10) |> flextable::set_caption( caption = "First 800 characters of the automatically spell-corrected OCR output." ) |> flextable::border_outer()```---# Putting It All Together {#workflow}::: {.callout-note}## Section Overview**What you'll learn:** A complete, production-ready workflow function that selects the appropriate extraction method (pdftools or tesseract), extracts text, and optionally applies spell correction — all in a single call:::The code below wraps the full pipeline into a single reusable function. It accepts a path to a PDF or a directory of PDFs, detects whether each file has an embedded text layer (and falls back to tesseract if not), and optionally applies spell correction.```{r workflow01, eval=FALSE, message=FALSE, warning=FALSE}#' Extract text from one or more PDFs, choosing the best method automatically#'#' @param path Path to a single PDF file or a directory containing PDFs#' @param lang Tesseract language code (default: "eng")#' @param spell_correct Apply automated spell correction to OCR output?#' @param min_chars_per_page Minimum characters per page to consider text#' layer valid (below this, fall back to tesseract)#' @return Named character vector of extracted textsextract_pdf_text <- function(path, lang = "eng", spell_correct = FALSE, min_chars_per_page = 50) { # Resolve input: single file or directory if (dir.exists(path)) { files <- list.files(path, pattern = "\\.pdf$", full.names = TRUE, ignore.case = TRUE) } else if (file.exists(path)) { files <- path } else { stop("Path does not exist: ", path) } engine <- tesseract::tesseract(lang) results <- sapply(files, function(f) { # Try pdftools first; check whether the text layer is usable pages_raw <- pdftools::pdf_text(f) avg_chars <- mean(nchar(stringr::str_squish(pages_raw))) has_textlayer <- avg_chars >= min_chars_per_page if (has_textlayer) { message(basename(f), ": using pdftools (text layer detected)") txt <- pages_raw |> paste0(collapse = " ") |> stringr::str_squish() } else { message(basename(f), ": using tesseract (no usable text layer)") txt <- tesseract::ocr(f, engine = engine) |> paste0(collapse = " ") |> stringr::str_squish() if (spell_correct) { toks <- hunspell::hunspell_parse(txt, dict = hunspell::dictionary("en_US"))[[1]] correct <- hunspell::hunspell_check(toks, dict = hunspell::dictionary("en_US")) suggs <- hunspell::hunspell_suggest(toks[!correct], dict = hunspell::dictionary("en_US")) toks[!correct] <- sapply(suggs, function(s) { if (length(s) == 0) NA_character_ else s[1] }) toks <- toks[!is.na(toks)] txt <- paste0(toks, collapse = " ") } } return(txt) }, USE.NAMES = TRUE) names(results) <- tools::file_path_sans_ext(basename(files)) return(results)}# --- Usage examples ----------------------------------------------------------# Single file — auto-detect methodtext1 <- extract_pdf_text("tutorials/pdf2txt/data/PDFs/pdf0.pdf")# Directory — auto-detect method for each filetexts_auto <- extract_pdf_text("tutorials/pdf2txt/data/PDFs/")# Directory — force spell correction for OCR fallback filestexts_corrected <- extract_pdf_text("tutorials/pdf2txt/data/PDFs/", spell_correct = TRUE)# Non-English documenttext_de <- extract_pdf_text("tutorials/pdf2txt/data/PDFs/german_report.pdf", lang = "deu")```---# Summary {#summary}This how-to has covered the complete PDF-to-text workflow in R:**Choosing a tool.** `pdftools` is the right choice for digitally generated PDFs with an embedded text layer — it is fast, requires no external dependencies beyond Poppler, and preserves the document's pagination and layout. `tesseract` is the right choice for scanned documents and image-based PDFs — it is slower but handles content that `pdftools` cannot access at all.**Beyond basic extraction.** `pdftools` also provides document metadata (`pdf_info()`), page dimensions (`pdf_pagesize()`), and font information (`pdf_fonts()`), all of which are useful for provenance tracking and diagnosing encoding problems. `tesseract` supports over 100 languages via downloadable language models and exposes configuration parameters for page segmentation mode and OCR engine selection that can significantly improve accuracy on challenging documents.**Pre-processing and spell correction.** For noisy scans, pre-processing the page images with `magick` (greyscale conversion, contrast enhancement, deskewing, despeckling) before OCR substantially improves recognition accuracy. Post-OCR spell checking with `hunspell` identifies non-dictionary tokens and can generate correction candidates, but automated correction should always be reviewed manually before application — particularly for documents containing proper nouns, technical vocabulary, or non-standard spelling conventions.**Production-ready workflow.** The `extract_pdf_text()` function presented in the final section wraps the full pipeline into a single call that automatically detects whether each PDF has a usable text layer and selects the appropriate extraction method accordingly.---# Citation and Session Info {-}Schweinberger, Martin. 2026. *Converting PDFs to Text with R*. Brisbane: The Language Technology and Data Analysis Laboratory (LADAL). url: https://ladal.edu.au/tutorials/pdf2txt/pdf2txt.html (Version 2026.02.24).```@manual{schweinberger2026pdf2txt, author = {Schweinberger, Martin}, title = {Converting PDFs to Text with R}, note = {https://ladal.edu.au/tutorials/pdf2txt/pdf2txt.html}, year = {2026}, organization = {The Language Technology and Data Analysis Laboratory (LADAL)}, address = {Brisbane}, edition = {2026.02.24}}``````{r session_info}sessionInfo()```::: {.callout-note}## AI Transparency StatementThis how-to was revised and substantially expanded with the assistance of **Claude** (claude.ai), a large language model created by Anthropic. Claude was used to restructure the document into Quarto format, add the pdftools vs tesseract comparison section, expand the pdftools section with metadata, page-level, and batch-processing examples, expand the tesseract section with language support and engine configuration, write the new multi-page scanned PDF and image pre-processing sections, expand the spell-checking section with a suggested-correction workflow and curated dictionary approach, and write the production-ready `extract_pdf_text()` wrapper function. All content was reviewed, edited, and approved by the author (Martin Schweinberger), who takes full responsibility for the accuracy of the material.:::---[Back to top](#intro)[Back to LADAL home](/)---# References {-}